Stroke Prediction

Table of Contents

I. Project Overview
II. Exploratory Data Analysis
III.Data Preprocessing
IV. Training Models
V. Results and Conclusions
VI. References

Project Overview

A stroke is a medical condition, which occurs when "blood flow to the brain is blocked. This prevents the brain from getting oxygen and nutrients from the blood. Without oxygen and nutrients, brain cells begin to die within minutes. Sudden bleeding in the brain can also cause a stroke if it damages brain cells" (National Heart, Lung, and Blood Institute, n.d.).
Main causes of the stroke are high blood pressure, diabetes and smoking.
The main purpose of the project is to predict whether there is risk of the stroke using the dataset, which is provided by Federico Soriano (2021) on Kaggle.

Overview of the dataset, provided by the author:

1) id: unique identifier
2) gender: "Male", "Female" or "Other"
3) age: age of the patient
4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
6) ever_married: "No" or "Yes"
7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8) Residence_type: "Rural" or "Urban"
9) avg_glucose_level: average glucose level in blood
10) bmi: body mass index
11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown" (which means that the information is unavailable for this patient)
12) stroke: 1 if the patient had a stroke or 0 if not

Exploratory Data Analysis

Only one column has NaN values.
Let us check the proportion of missing data:

3% of data in "BMI" column is missing.
Let us start with the general overview and findings.

Here we see some outliers, but we can say that the stroke primarily affects elder people.

We know that the blood sugar level test has the following results (Mayo Clinic, n.d.):

Body mass index can be measured as following:

Other healthfactors:

We can say that the presence of both hypertention (1) and heart desease (1) is also a risk factor. Smokers, even the former ones, have more chances to get a stroke.

Other factors:

Based on these plots, we can say that both males and females suffer from strokes.
Not married people are definitely at risk. We can suggest that self-employed people have less stress and are not in the risk group. Type of the residence has no effect on the probability of getting a stroke.

Data Preprocessing

  1. We need to deal with NaN values in BMI column. Because it is only 3% of the all BMI data, I fill missing data with means:
  1. I had some concerns about the "unknown" values in the smoking status variable, but it is almost 1/3 of the dataset. So, I decided to leave it as it is in the case we want to predict the probability of the stroke for someone whose smoker status is unknown. I dropped a single instance, in which the gender variable had value "other", because there is only one such observation in the dataset.
  1. Now, let us create dummy variables for work_type and smoking_status variables:
  1. If we check the dataset, we can see that it is imbalanced for the stroke variable:

The model trained on the defaut version of the dataset gives accuracy 99% with 0.05 f1-score for predicting high risk of the stroke. This is very poor f1-score. Therefore, I decided to implement either oversamlpling or downsampling of the dataset.
In this case, however, oversampling based on 249 rows of "stroke cases" can cause the overfitting, specially for higher over-sampling rates, and decrease the classifier performance (Branco, Torgo and Ribeiro, 2015).
So, I decided to undersample the dataset. Making set perfectly balanced, however, can lead to worse prediction of no-stroke cases, so I undersampled the dataset to the 1109 entries and I still use the tecniques which are pretty robust when dealing with imbalanced data:

Training Models

Results and Conclusions

K-Nearest Neighbors Classifier and Random Forest Classifier snow the highest f1-score and accuracy.
Becaise of relatively high f1-score on determiming no-stroke cases, we can say that downsampling did not cause the lack of data.

In general, we can say that, first of all, the dataset can be improved. It is not only imbalanced, but also does not differentiate between different types of strokes: ischemic strokes, which mostly caused by blockage of a blood vessel, and hemorrhagic strokes, which can occur after the head trauma (Caceres and Goldstein, 2012).



References:

1. Branco, P., Torgo, L. & Ribeiro, R. (2015). A Survey of Predictive Modelling under Imbalanced Distributions. Machine Learning. Retrieved from https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/
2. Caceres, J. A., & Goldstein, J. N. (2012). Intracranial hemorrhage. Emergency medicine clinics of North America, 30(3), 771–794. https://doi.org/10.1016/j.emc.2012.06.003
3. Mayo Clinic. (n.d.). Diabetes. Diseases & Conditions. Retrieved from https://www.mayoclinic.org/diseases-conditions/diabetes/diagnosis-treatment/drc-20371451
4. National Heart, Lung, and Blood Institute. (n.d). Stroke. Retrieved from https://www.nhlbi.nih.gov/health-topics/stroke
5. Soriano, F. (2021). Stroke Prediction Dataset. Kaggle. Retrieved from https://www.kaggle.com/fedesoriano/stroke-prediction-dataset
6. World Health Organisation. (n.d). Body mass index - BMI. A healthy lifestyle. Retrieved from https://www.euro.who.int/en/health-topics/disease-prevention/nutrition/a-healthy-lifestyle/body-mass-index-bmi